There was a forest full of forked roads, trees, hedges, and people too. Thus a story begins with the situation, a context. That’s the place where data resides. But life gets complicated as the hedges are cut down, people must make decisions with trade-offs about the roads and trees, and the weather turns bad. Questions arise, the environment and people respond. We have a very (generalized!) story brewing here.
Multi-level models are story-telling devices in their own right and every story keeps the flow, the data, the characters, what the characters do to one another, all as simple as possible. We pool data and shrink the number of parameters to build models both amnesic and a little agnostic. We will provide the extra memory and causal cautionary tales. One of the most important characteristics of any story is that one aspect of the story leaves you panting for another segment.
Pooling of information across aspects, features, segments of a story will help us understand and participate in the whole story. No pooling often leaves us wondering what our names are after hearing a story, let alone what the story is about. Full pooling leaves nothing to the imagination. Partial pooling allows story components to share enough information to bring all of the story together into a unity.
Imagine a cluster of customer observations from different markets. In this context pooling means using the information from other markets to inform our estimates for each market. A model with no pooling means that each market is the first market that we have ever seen, as other markets have no effect on our estimates. Models without pooling have amnesia, if they ever remembered at all.
As a consequence pooling produces shrinkage of parameter estimates as each market’s ensemble of estimates will gravitate with a centripetal force of attraction to the inertial center of the global mean across markets. Nice concepts, but how do multilevel models do this? Do small market sample sizes share more or less information than large sample size markets? We have much to ponder.
Simulations allow us to know the true, so-called population, parameters of variates simply because we set it up that way. Because we are in charge of the data, we can check whether our design actually gets estimated correctly in a model.
In a multilevel model, the parameters for each market derive from the same (pool) of a common statistical population parameters. For example, the family of varying intercepts for each market in the market. As the model samples hypotheses for each paramter for each market in the market, the parameter learns about the sampling of other parameters in other segements of the same market pool.
The sampling of each parameter complements the sampling, and cross-market learning as a result, of all other parameters in the other markets. We thus have an adaptively regularized family of parameters. It is adaptive in that each parameter learns from the sampling, the learning, that happens in the other parameters. It is regularized in that the parameters effectively shrink in number and in size to the pool from which they are sampled. The amount of shrinkage is directly proportional to the estimated variation estimated and compatible with the data for the distribution of the family of parameters. The more influenced parameters are going to be those that come from markets with small sample sizes.
However, it is one thing to have some intuition and another one is to really understand something. When it comes to statistics, I am a big fan of simulation. Thankfully, Richard does precisely this in chapter 12. Let’s simulate a model to visualize both pooling and shrinking.
We simulate the number of students who passed some exam at different markets at one market. That is, each market has \(S_i\) students who passed the test, from a maximum of \(N_i\). The model then is the following:
\[ \begin{align} S_i &\sim Binomial(N_i, p_i), \\ logit(p_i) &= \alpha_{market_{[i]}}, \\ \alpha_j &\sim Normal(\overline{\alpha}, \sigma), \\ \overline{\alpha} &\sim Normal(0, 1.5), \\ \sigma &\sim Exponential(1) \end{align} \]
And, we know that we could also have a volatility model involved here as well. We also know that such a specification should also thicken or thin the posterior distribution tails. But we leave that for another simulation.
We have customers in markets. What do customers do? They visit the market and they sometimes transact for goods and services. We assume a distribution for the average log-odds of actually transacting for each market.
\[ \alpha_j \sim Normal(\bar{\alpha}, \sigma). \]
In this simulation the prior for each intercept will be a distribution that we will simultaneously learn as we learn the individual parameters. We have hyper-priors: priors for the parameters of the distribution of intercepts (\(\overline{\alpha}, \sigma\)).
Quite a lot to pull into the analysis. So an analyst must:
Not only state assumptions in priors
But also, state further, layered or multi-level, assumptions about a distribution-reverting mechanism
Is this too much for an analyst to do? We think not.
To simulate this model, we define the parameters of the distribution of intercepts. For each market, we will simulate an average log-odds of transacting. Using the average log-odds of transacting we will simulate the number of customers in each market who transacted using the binomial bag of beans.
Epistemology answers the question, what is it to know? Our answer is some sort of probable inference. Ontology answers the question, given we know, what is it that we know? Neither hyper-parameters or parameters are in this schema? In a sense only that which is experienced, observed, sensed, is in the simulation of variates. Hyper-parameters, parameters and so on are considered by some to be the figments of furtive imaginations in analysts that make decisions about the relationship of parameters, otherwise known as a model. The model has a purpose, materials, a design, and an agency. It also possesses something of an objectivity, which in our case is the setting up of the data to test the model that emanates from the furtive analytical imagination.
In this way we begin by setting the parameters of the population of intercepts.
a_bar <- 1.5
sigma <- 1.5
n_markets <- 50
# customers per market
N_i <- as.integer(rep(c(5, 10, 25, 35, 40), each = 10))
With these parameters we can now simulate the average log-odds of transacting for the good (or service) for each of the markets by using, for example, a univariate normal random number.
avg_lod_odds_per_market <- rnorm(n_markets, mean = a_bar, sd = sigma)
Very Gaussian of us indeed. Perhaps we should try a Pareto power distributon with thresholds against which customers might transact, or not. But again we leave this idea, this thought experiment, for another time.
We now have the following simulation of data about the markets and customers within markets. Their only claim to fame is their transactions.
data_simulation <- data.frame(
market = 1:n_markets,
N_i = N_i,
true_log_odds = avg_lod_odds_per_market)
glimpse(data_simulation)
## Rows: 50
## Columns: 3
## $ market <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ N_i <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10, 10,...
## $ true_log_odds <dbl> 3.5564377, 0.6529527, 2.0446926, 2.4492939, 2.1064025...
The nomer true_ refers to the data we have, in all of our proportionate ability, to name the actual, true data, the imposed objectivity of this simulation.
We transform the log odds of the logistic into the binomial distribution. With this prestidigitation we can generate the number of customers who transacted, by market.
In modeling parlance, we know that the logistic is the inverse of the logit. We thus transform log-odds into probability.
data_simulation <- data_simulation %>%
mutate(
number_transacted = rbinom(n_markets,
prob = logistic(true_log_odds),
size = N_i)
)
glimpse(data_simulation)
## Rows: 50
## Columns: 4
## $ market <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ N_i <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10,...
## $ true_log_odds <dbl> 3.5564377, 0.6529527, 2.0446926, 2.4492939, 2.106...
## $ number_transacted <int> 5, 4, 5, 5, 3, 2, 5, 3, 5, 5, 10, 10, 5, 8, 8, 9,...
plt <- data_simulation %>%
ggplot( aes(market, number_transacted, color = N_i) ) +
geom_point() +
scale_color_viridis_c() +
labs( title = "Simulated customers who transacted per market",
color = "Initial #" )
ggplotly( plt )
Does this look reasonable? Yes as we eyeball the various market groupings of transacting customers. More customers transact in the upper markets than in the lower markets. We will be able to answer one of our questions about pooling and shrinkage: does market sample size matter?
We have three possibilities.
Do not pool at all. This is a model that has complete amnesia. It will not convey information across parameters.
Partially pool. Allow some pooling up to a point of perhaps overloading the model’s fragile memory. We will allow some fuzziness to occur. After all don’t our simulated transactors wander in and out of each other’s markets?
Pool everything. Hold on, we will know everything about everything? Does data, thinking about data with a purpose, acting on data to achieve a result live in this storied environment?
Pooling means using the information from other markets to inform our predictions of estimated probabilities of customers transacting in different markets. Therefore, no-pooling means treating each market completely unrelated to other markets. This is equivalent to estimating an infinite value of the variance of the population of parameters.
Therefore, our estimate of the probability of transacting in each market will just be the raw sample proportion of customers transacting in each market.
data_simulation <- data_simulation %>%
mutate(estimated_probability_no_pooling = number_transacted / N_i)
glimpse(data_simulation)
## Rows: 50
## Columns: 5
## $ market <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,...
## $ N_i <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, ...
## $ true_log_odds <dbl> 3.5564377, 0.6529527, 2.0446926, 2...
## $ number_transacted <int> 5, 4, 5, 5, 3, 2, 5, 3, 5, 5, 10, ...
## $ estimated_probability_no_pooling <dbl> 1.00, 0.80, 1.00, 1.00, 0.60, 0.40...
The lines are clear and drawn and uninformative about market relationships.
Partial pooling means to model explicitly the population of parameters. With the estimation of a mean and a standard deviation, we can regularize adaptively, and this action will skrink the parameters into our predictions. Thus we fit a multilevel binomial model of transacting customers in markets.
# trimmed data for estimation in ulam() and stan
data_model <- list(
S_i = data_simulation$number_transacted,
N_i = data_simulation$N_i,
market = data_simulation$market
)
#
multilevel_model <- alist(
S_i ~ dbinom(N_i, p),
logit(p) <- a_market[market], # each market will have its own average log odds of transacting
a_market[market] ~ dnorm(a_bar, sigma),
a_bar ~ dnorm(0, 1.5),
sigma ~ dexp(1)
)
Then, we use ulam() and stan to derive the posterior distribution using the Hamilton Monte Carlo approach with the no-u-turns-sampler (NUTS).
multilevel_fit <- ulam(
multilevel_model,
data = data_model,
chains = 1, log_lik = TRUE
)
##
## SAMPLING FOR MODEL '48b25e8658507514b9821a7920cf7938' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 1: Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1: Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1: Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1: Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1: Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1: Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1: Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1: Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1: Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1: Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1: Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.317 seconds (Warm-up)
## Chain 1: 0.213 seconds (Sampling)
## Chain 1: 0.53 seconds (Total)
## Chain 1:
precis( multilevel_fit )
## mean sd 5.5% 94.5% n_eff Rhat4
## a_bar 1.347924 0.2547422 0.9840563 1.766829 645.6942 0.9980424
## sigma 1.615880 0.2537554 1.2578163 2.038842 277.1777 1.0081819
For one chain, this estimation pulls the parameter values fairly closely, within a standard deviation of the simulations, to the 1.5 values we chose for the data. The model so far seems somewhat compatible with the data. We might go so far to say that the model produces parameter estimates for the mean a_bar between 0.98and1.77` about 89% of the ways that are possibly compatible with our simulated data.
We can evaluate the validity of our Markov Chains using the following traceplots. If we were to run these plots against more than one chain we would likely conclude that the chains mix well across the two parameters, that they are fairly stable, and that the chains converged to nearly the same estimates.
traceplot_ulam(multilevel_fit)
Now, let’s find out how well the Markov Chain Monte Carlo estimates converged with Rhat4, a measure of convergence. A value of 1 means the MCMC estimates converged.
precis(multilevel_fit, depth = 2) %>%
data.frame() %>%
select(Rhat4) %>%
summary()
## Rhat4
## Min. :0.9980
## 1st Qu.:0.9981
## Median :0.9984
## Mean :0.9992
## 3rd Qu.:0.9995
## Max. :1.0082
The Rhat values indidate that we sampled correctly from our posterior distributons. Let’s use these samples from the posterior distribution to calculate our estimated log-odds of survival for each market.
posterior_samples <- extract.samples( multilevel_fit )
glimpse( posterior_samples )
## List of 3
## $ a_market: num [1:500, 1:50] 2.453 2.239 0.539 2.036 1.489 ...
## $ a_bar : num [1:500(1d)] 1.31 1.52 1.44 1.23 1.78 ...
## $ sigma : num [1:500(1d)] 1.47 1.73 1.19 1.8 2.35 ...
## - attr(*, "source")= chr "ulam posterior: 500 samples from object"
summary( posterior_samples$a_market )
## V1 V2 V3 V4
## Min. :-0.5103 Min. :-0.7299 Min. :-0.2144 Min. :-0.7858
## 1st Qu.: 1.8675 1st Qu.: 0.9432 1st Qu.: 1.8433 1st Qu.: 1.8150
## Median : 2.5627 Median : 1.4735 Median : 2.5150 Median : 2.6264
## Mean : 2.6776 Mean : 1.5207 Mean : 2.7167 Mean : 2.7077
## 3rd Qu.: 3.4570 3rd Qu.: 2.0446 3rd Qu.: 3.5337 3rd Qu.: 3.4461
## Max. : 6.9369 Max. : 4.5156 Max. : 7.1810 Max. : 7.3637
## V5 V6 V7 V8
## Min. :-1.5217 Min. :-2.57273 Min. :0.148 Min. :-2.3871
## 1st Qu.: 0.1710 1st Qu.:-0.50475 1st Qu.:1.891 1st Qu.: 0.1323
## Median : 0.6937 Median : 0.03162 Median :2.577 Median : 0.6342
## Mean : 0.7357 Mean : 0.02944 Mean :2.751 Mean : 0.7044
## 3rd Qu.: 1.2900 3rd Qu.: 0.54901 3rd Qu.:3.560 3rd Qu.: 1.2143
## Max. : 3.0729 Max. : 2.52856 Max. :7.788 Max. : 3.9193
## V9 V10 V11 V12
## Min. :-0.0194 Min. :-0.05306 Min. :0.3511 Min. :0.4774
## 1st Qu.: 1.8346 1st Qu.: 1.84965 1st Qu.:2.2744 1st Qu.:2.3733
## Median : 2.5902 Median : 2.51379 Median :2.9991 Median :2.9889
## Mean : 2.6794 Mean : 2.68319 Mean :3.1171 Mean :3.0834
## 3rd Qu.: 3.4116 3rd Qu.: 3.39019 3rd Qu.:3.7963 3rd Qu.:3.6593
## Max. : 6.1589 Max. : 6.88006 Max. :7.8415 Max. :8.2027
## V13 V14 V15 V16
## Min. :-1.4966 Min. :-0.2131 Min. :-0.2055 Min. :0.2798
## 1st Qu.:-0.2213 1st Qu.: 1.0202 1st Qu.: 0.9217 1st Qu.:1.5365
## Median : 0.2107 Median : 1.4365 Median : 1.4498 Median :2.0481
## Mean : 0.2362 Mean : 1.5113 Mean : 1.4734 Mean :2.1180
## 3rd Qu.: 0.6526 3rd Qu.: 1.9678 3rd Qu.: 1.9840 3rd Qu.:2.5873
## Max. : 2.1107 Max. : 3.6335 Max. : 3.9021 Max. :6.7987
## V17 V18 V19 V20
## Min. :0.3127 Min. :-5.46608 Min. :-4.3787 Min. :-0.2644
## 1st Qu.:2.2409 1st Qu.:-2.64172 1st Qu.:-1.8291 1st Qu.: 1.5714
## Median :2.9347 Median :-2.08895 Median :-1.3750 Median : 2.1134
## Mean :3.0002 Mean :-2.12591 Mean :-1.4255 Mean : 2.1639
## 3rd Qu.:3.5848 3rd Qu.:-1.49090 3rd Qu.:-0.9557 3rd Qu.: 2.7004
## Max. :6.8233 Max. :-0.03203 Max. : 0.5073 Max. : 5.2495
## V21 V22 V23 V24
## Min. :0.01134 Min. :-2.54460 Min. :-0.5480 Min. :0.8049
## 1st Qu.:0.93116 1st Qu.:-1.48365 1st Qu.: 0.8966 1st Qu.:2.4145
## Median :1.18143 Median :-1.16581 Median : 1.2026 Median :2.9177
## Mean :1.20967 Mean :-1.18606 Mean : 1.2185 Mean :2.9391
## 3rd Qu.:1.48541 3rd Qu.:-0.88348 3rd Qu.: 1.5257 3rd Qu.:3.4268
## Max. :2.73146 Max. : 0.05926 Max. : 3.5885 Max. :5.7931
## V25 V26 V27 V28
## Min. :1.211 Min. :-0.4323 Min. :-0.7182 Min. :-1.5583
## 1st Qu.:2.963 1st Qu.: 0.7057 1st Qu.: 0.3917 1st Qu.:-0.6010
## Median :3.552 Median : 0.9976 Median : 0.6321 Median :-0.2963
## Mean :3.647 Mean : 1.0246 Mean : 0.6612 Mean :-0.3137
## 3rd Qu.:4.274 3rd Qu.: 1.3350 3rd Qu.: 0.9198 3rd Qu.:-0.0279
## Max. :7.789 Max. : 2.2600 Max. : 2.5197 Max. : 0.7582
## V29 V30 V31 V32
## Min. :0.9073 Min. :-0.90707 Min. :0.4196 Min. :0.8413
## 1st Qu.:2.3765 1st Qu.: 0.07084 1st Qu.:1.3177 1st Qu.:1.9556
## Median :2.8098 Median : 0.33579 Median :1.5898 Median :2.3274
## Mean :2.8906 Mean : 0.33268 Mean :1.6390 Mean :2.3297
## 3rd Qu.:3.3481 3rd Qu.: 0.57211 3rd Qu.:1.9321 3rd Qu.:2.6798
## Max. :5.6674 Max. : 1.52168 Max. :3.6829 Max. :4.1580
## V33 V34 V35 V36
## Min. :1.101 Min. :-0.7895 Min. :0.7633 Min. :-2.1235
## 1st Qu.:2.246 1st Qu.:-0.0871 1st Qu.:1.4822 1st Qu.:-1.1708
## Median :2.655 Median : 0.1118 Median :1.7759 Median :-0.9426
## Mean :2.702 Mean : 0.1240 Mean :1.8081 Mean :-0.9477
## 3rd Qu.:3.076 3rd Qu.: 0.3197 3rd Qu.:2.0937 3rd Qu.:-0.6943
## Max. :5.286 Max. : 1.1699 Max. :3.3043 Max. : 0.2092
## V37 V38 V39 V40
## Min. :-0.97981 Min. :-0.3542 Min. :-2.979 Min. :0.3441
## 1st Qu.:-0.13053 1st Qu.: 0.4123 1st Qu.:-1.850 1st Qu.:1.1226
## Median : 0.10106 Median : 0.7171 Median :-1.600 Median :1.4140
## Mean : 0.09765 Mean : 0.7028 Mean :-1.597 Mean :1.4253
## 3rd Qu.: 0.31659 3rd Qu.: 0.9664 3rd Qu.:-1.292 3rd Qu.:1.6926
## Max. : 1.11785 Max. : 1.8466 Max. :-0.459 Max. :3.1198
## V41 V42 V43 V44
## Min. :1.087 Min. :0.5578 Min. :1.172 Min. :-0.71873
## 1st Qu.:2.065 1st Qu.:1.2595 1st Qu.:2.389 1st Qu.: 0.05207
## Median :2.467 Median :1.6025 Median :2.728 Median : 0.23924
## Mean :2.479 Mean :1.5876 Mean :2.786 Mean : 0.25166
## 3rd Qu.:2.829 3rd Qu.:1.8906 3rd Qu.:3.111 3rd Qu.: 0.44896
## Max. :4.310 Max. :3.1970 Max. :5.451 Max. : 1.28876
## V45 V46 V47 V48
## Min. :-1.2524 Min. :0.6458 Min. :-0.5757 Min. :1.160
## 1st Qu.:-0.5622 1st Qu.:1.4902 1st Qu.: 0.4477 1st Qu.:2.098
## Median :-0.3427 Median :1.7671 Median : 0.6749 Median :2.423
## Mean :-0.3435 Mean :1.7748 Mean : 0.6717 Mean :2.462
## 3rd Qu.:-0.1292 3rd Qu.:2.0273 3rd Qu.: 0.8764 3rd Qu.:2.769
## Max. : 0.5476 Max. :3.2006 Max. : 1.8239 Max. :4.117
## V49 V50
## Min. :0.02562 Min. :1.167
## 1st Qu.:0.90468 1st Qu.:2.758
## Median :1.11844 Median :3.253
## Mean :1.12614 Mean :3.301
## 3rd Qu.:1.33620 3rd Qu.:3.756
## Max. :2.10368 Max. :7.159
Yes, 50 markets. We asked for that!
summary( posterior_samples$a_bar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6089 1.1652 1.3467 1.3479 1.5075 2.2848
summary( posterior_samples$sigma)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.067 1.440 1.597 1.616 1.750 2.766
Before we calculate our estimated log-odds, let’s check our estimates for the population of parameters from which each intercept comes:
plt <- data.frame(alpha_bar = posterior_samples$a_bar) %>%
ggplot(aes(alpha_bar)) +
geom_histogram(binwidth = 0.01, color = "darkblue", fill = "dodgerblue4", alpha = 0.7) +
geom_vline(aes(xintercept = 1.5), linetype = 2, color = "red") +
labs(title = "Posterior samples for population intercepts (alpha)")
ggplotly( plt )
It seems that we’ve correctly captured the mean of the population. The MAP is about 1.38. What about the standard deviation of the distribution?
plt <- data.frame(sigma = posterior_samples$sigma) %>%
ggplot(aes(sigma)) +
geom_histogram(binwidth = 0.01, color = "darkblue", fill = "blue", alpha = 0.7) +
geom_vline(aes(xintercept = 1.5), linetype = 2, color = "red") +
labs(title = "Posterior samples for population standard deviation")
ggplotly( plt )
Our estimates for the variation in the population could be better. The MAP can be either 1.62 or 1.67, the vagaries of MCMC. Let’s check our estimated probability of transacting for each market.
# first our logistic function t convert scores into probabilities
logistic_ours <- function(z) {
1/(1+exp(-z))
}
# yes, a matrix of estimated intercepts by market
matrix_estimated_probs <- logistic_ours(posterior_samples$a_market)
glimpse(matrix_estimated_probs)
## num [1:500, 1:50] 0.921 0.904 0.632 0.884 0.816 ...
We have a matrix of 500 rows (500 simulations) and 50 columns (50 different markets). The column average across samples will estimate the probability of transacting in each market.
partial_pooling_estimates <- apply(matrix_estimated_probs, 2, mean)
data_simulation <- data.frame(
estimated_probability_partial_pooling = partial_pooling_estimates) %>%
cbind(data_simulation)
glimpse(data_simulation)
## Rows: 50
## Columns: 6
## $ estimated_probability_partial_pooling <dbl> 0.9007274, 0.7873951, 0.90168...
## $ market <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10...
## $ N_i <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
## $ true_log_odds <dbl> 3.5564377, 0.6529527, 2.04469...
## $ number_transacted <int> 5, 4, 5, 5, 3, 2, 5, 3, 5, 5,...
## $ estimated_probability_no_pooling <dbl> 1.00, 0.80, 1.00, 1.00, 0.60,...
Then we transform our true log-odds into true probabilities:
data_simulation <- data_simulation %>%
mutate(true_probabilities = inv_logit(true_log_odds)
)
glimpse(data_simulation)
## Rows: 50
## Columns: 7
## $ estimated_probability_partial_pooling <dbl> 0.9007274, 0.7873951, 0.90168...
## $ market <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10...
## $ N_i <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,...
## $ true_log_odds <dbl> 3.5564377, 0.6529527, 2.04469...
## $ number_transacted <int> 5, 4, 5, 5, 3, 2, 5, 3, 5, 5,...
## $ estimated_probability_no_pooling <dbl> 1.00, 0.80, 1.00, 1.00, 0.60,...
## $ true_probabilities <dbl> 0.97225163, 0.65767555, 0.885...
We are now ready to answer further questions about the efficacy of pooling and the resulting shrinkage.
Pooling means sharing information across, in this simulation, markets. This is done by explicitly modeling the distribution of the average log-odds ratios of transacting across markets. In this way our estimated mean for the distribution of intercepts for each market will inform each of our predictions. Let’s calculate this estimated global mean across markets.
estimated_global_mean <- data.frame(
alpha_bar = posterior_samples$a_bar) %>%
mutate(alpha_bar = inv_logit(alpha_bar)) %>%
summarise(mean(alpha_bar))
estimated_global_mean <- estimated_global_mean[1,1]
glue::glue("The estimated global mean is: {round(estimated_global_mean, 2)}")
## The estimated global mean is: 0.79
We should view our handiwork to understand how market estimates relate to the estimated global mean.
levels_market <- c("Sample size in markets: 5",
"Sample size in markets: 10",
"Sample size in markets: 25",
"Sample size in markets: 35",
"Sample size in markets: 40"
)
plt_data <- data_simulation %>%
select(market, estimated_probability_partial_pooling, estimated_probability_no_pooling, N_i) %>%
pivot_longer(-c(market, N_i), names_to = "method", values_to = "estimated_probability") %>%
mutate(N_i = glue::glue("Sample size in markets: {N_i}"),
N_i = factor(N_i, levels = levels_market))
plt <- plt_data %>%
ggplot(aes(market, estimated_probability, color = method)) +
geom_point(alpha = 0.6) +
geom_hline(aes(yintercept = estimated_global_mean), linetype = 2, color = "red") +
facet_wrap(~N_i, scales = "free") +
scale_color_viridis_d() +
scale_y_continuous(labels = scales::percent) +
theme(legend.position = "bottom") +
labs(title = "Pooling and Shrinking in a Multilevel Model",
subtitle = "Global estimated mean informs predictions for each market: shrinking estimates to the global estimated mean",
caption = "Global estimated mean shown in red.")
ggplotly( plt )
Now we can see that, with partial pooling, our estimates are informed by the estimated global mean. We shrink whatever proportion we calculate for the specific market towards this overall mean. We can be seen by zooming in on the yellow points, the estimates from partial pooling, and noticing that they are always closer to the red line than the purple points, the sample market proportion. Pooling results in more aggressive shrinkage for the markets where we have fewer data. Will these market predictions be the ones that need to be shrunk the most?
To answer that last question we can compare how well the different models performed.
# let's recall the levels and their sample sizes h
levels_market <- c("Sample size in markets: 5",
"Sample size in markets: 10",
"Sample size in markets: 25",
"Sample size in markets: 35",
"Sample size in markets: 40")
# we could use a paste0() across the sample sizes
plt_data <- data_simulation %>%
mutate(no_pooling_error = abs(estimated_probability_no_pooling - true_probabilities),
partial_pooling_error = abs(estimated_probability_partial_pooling - true_probabilities)) %>%
select(market, no_pooling_error, partial_pooling_error, N_i) %>%
pivot_longer(-c(market, N_i), names_to = "method", values_to = "error") %>%
mutate(N_i = glue::glue("Sample size in markets: {N_i}"),
N_i = factor(N_i, levels = levels_market))
plt <- plt_data %>%
ggplot(aes(error, factor(market), color = method)) +
geom_point(alpha = 0.6) +
scale_color_viridis_d() +
facet_wrap(~N_i, scales = "free_y") +
hrbrthemes::theme_ipsum_rc(grid = "Y") +
theme(legend.position = "bottom") +
labs(title = "Partial pooling vs. no-pooling: benefits of shrinkage",
subtitle = "Low sample sizes and outliers benefit",
y = "market")
ggplotly( plt )
This plot compares our estimated probability to the true probability we simulated across markets.
Some reflections are in order.
Partial pooling will shrink the number of effective variables and yield less complex models. This is most helpful for the markets where we have relatively fewer data (i.e., markets with sample size of 5 and 10). For these markets, we complement the little data that we have with the information pooled from larger markets. In this way we shrink our estimates to the population mean that we’ve estimated across all markets. The model with no pooling just uses the information in the low sample markets, resulting in overfitting that shows itself in the plot in the form of large prediction errors. The comparison between the two methods shows us how shrinkage can reduce overfitting and thus predict more reliably out of sample.
The amount of shrinkage depends on the amount of data available. When we have fewer data, we shrink a lot. When we have lots of data, we do shrink, but a lot less. For markets that have lots of data ( e.g., sample size of 35), partial pooling results in an almost identical prediction as the method with no pooling.
The model with no pooling can sometimes beat the model with partial pooling. However, on average, the model with partial pooling will often perform much better.